Ultra High Resolution (UHR)-IonStar is an MS1-based quantitative method for label-free proteomics experiments, devised to address issues related with quantitative precision, missing data, and false-positive discovery of protein changes in large-cohort analysis.
UHR-IonStar comprises of two parts: experimental procedures (left panel) and a proteomics data analysis pipeline (right panel). This manual provides a sample preparation protocol and also focuses on the data analysis pipeline part of UHR-IonStar, aiming at helping UHR-IonStar users to run the pipeline in their own computational environment.
The followings describe a general preparation protocol for UHR-IonStar. Details of the experimental procedures can be found in Shen et al. J Proteome Res. (2017) and An et al. Anal Chem. (2015)
Take 50mg (or 100mg if low protein yield) tissue for following analysis.
Weigh the tissue sample and record.
Prepare Lysis buffer with protease and phosphatase inhibitors (\(1\) tablet for \(10ml\) lysis buffer):
Add \(10\)x lysis buffer.
Vortex and spin.
For small amount of tissue: homogenize for 30s using pellet pestle.
For tough tissue (i.e. skin): Set the speed of polytron at \(15,000~rpm\) (between samples, use methanol, water, lysis buffer to wash the probe sequentially); Homogenize the sample for \(5\) seconds burst, cooling, repeat \(5\) times.
Put the samples on ice seating one hour.
Burst sonication \(8\) times with high energy sonicator (level \(14\)), cooling \(2\) seconds, repeat \(3\) times. Rinse the probe between samples.
Put the samples on ice seating one hour or overnight to allow thorough protein extraction.
Centrifuge at \(20000\)g, \(4^\circ C\) for \(30\) minutes. Prepare a new set of EP tubes. Prepare standards for BCA.
Take the supernatant (extracted proteins) to the new set of EP tubes and measure protein concentration using BCA.
Take \(100\mu g\) protein per sample out (according to eh protein concentration measured by BCA) and dilute the solution with \(0.5\%\) SDS to \(100\mu l\).
Prepare and add \(5\mu l\) DTT to each tube, vortex and spin, incubate at \(56^\circ C\) for \(30\)min.
Prepare and add \(5\mu l\) IAM to each tube, vortex and spin, incubate at \(37^\circ C\) for \(30\)min (IAM is light sensitive, make and use in the dark).
Add small volume (\(sample: acetone = 1:1\), around \(100\mu l\)) of chilled acetone (\(-20^\circ C\)) and vortex.
Add large volume (\(sample: acetone = 1: 4~to~5\), around \(500\mu l\)) of chilled acetone, vortex, then incubate at \(-20^\circ C\) for \(3\) hours or overnight.
Centrifuge the samples at \(20,000\)g for 30min at \(4^\circ C\), remove supernatant (absorb the liquid with Kimtech carefully, add around \(700\mu l\) methanol rinse the tube and discard methanol, spin, take out the rest of the liquid with loading tip).
Re-suspend the protein pellets in \(80\mu l\) Tris-FA (\(pH=8.5\)) buffer, sonicate gently to loosen the pellets.
Activate trypsin: add 85µl of Tris-FA (\(pH=8.5\)) buffer to \(20\mu g\) trypsin (Sigma-Aldrich). If using multiple tubes of enzyme, combine all the fractions together and mix well before adding to samples.
Add activated trypsin (\(20\mu l\)) in to the tubes with \(Substrate:Enzyme = 20\), incubate \(6\) hours or overnight at \(37^\circ C\).
Terminate the digestion by adding \(1\mu l\) FA (\(1\%\) v/v) to each tube, vortex, centrifuge at \(20,000\)g at \(4^\circ C\) for \(30\)min. Prepare a new set of sample vials.
Transfer about \(90\mu l\) of the supernatant to sample vial carefully for LC-MS analysis.
The primary software packages used in UHR-IonStar are SIEVETM and UHR-IonStar.
SIEVETM is a commercial software from Thermo Fisher Scientific. The latest version of SIEVETM is v2.2 SP2. Please contact Thermo Fisher Scientific regarding the quote for SIEVETM. To ensure of proper performance of SIEVETM, we recommend running SIEVETM on a PC with at least 16-core processors and at least 192 GB RAM.
R Shiny Web App UHR-IonStar , which is built by Dr.Qu’s lab under R version 3.6.2 and R Bioconductor version 3.10, can be downloaded here.
Protein identification can be performed by any database searching engines and post-search processing tools. The final output is a so-called spectrum report containing PSMs from all sample runs passing the confidence threshold (e.g. FDR). The spectrum report can be exported from a number of software packages, e.g. Proteome Discoverer, Scaffold. Key information necessary for data integration include rawfile name and MS2 scan number. The file format of the spectrum report needs to be .csv.
The currently protein identification workflow used by our group features database searching by MS-GF+, post-search processing by IDPicker, and spectrum report generation by IonStarSPG.R. Detailed instructions can also be found.
Quantitative feature generation in UHR-IonStar is accomplished by SIEVETM v2.2 SP2 (Thermo Scientific), which integrates ChromAlign for global 3-D chromatographic alignment and a direct ion current extraction (DICE) method for feature extraction.
To start the quantitative feature generation analysis, open SIEVETM and select File -> Create new experiment. On the Designate Experiment Type page, select the Experiment Type based on the study. For a case-control experiment, use Two Sample Differential Analysis; for multi-condition experiment (3 or more conditions including control), use Control Compare Trend.
Drag all rawfiles into the Raw File Selection page.
For Two Sample Differential Analysis, assign Condition A and Condition B in the two boxes; For Control Compare Trend, put all conditions in the upper box and assign the control condition in the lower box.
Two Sample Differential Analysis:
Control Compare Trend:
A reference file also needs to be selected. In general, the reference file should provide the highest alignment scores for all sample runs.In most cases, it is recommended to start with a file in the middle of the LC-MS sequence as the reference.
The parameters that needs to be modified include Frame Time Width (min) and M/Z Width (ppm). The current setting is based on a 3-hr nano RPLC gradient with a Thermo Orbitrap instrument under 120K MS1 resolution. Manual optimization based on the LC-MS method may help to improve the performance of feature generation. All other parameters follow the default settings.
Check Generate all frames based upon all MS2 scan’s retention times and precursor M/Zs to maximize the number of quantitatve features. Alternatively, users can assign Maximum Number of Frames and Peak Intensity Threshold.
After setting the method, finish the wizard and save the .sdb file.
For UHR-IonStar, users do not need to run the Identify process. In the SIEVE Parameters window, MaxThreads should be changed according to the configuration of the computer used for SIEVETM. For example, 6~8 threads are recommended for a PC with 16-core processors and 192 GB RAM. Occasionally, PCAProcess can also be disabled to alleviate computational burden. Click the Update button to save the settings. Run Align (ChromAlign) first.
Upon finishing, alignment scores for all sample runs will be shown in the Alignment tab. Ideally, the majority of sample runs should have an alignment score of >0.8 to ensure the quality of quantitative feature generation. Change the reference file and rerun the ChromAlign process if the alignment scores are subpar (e.g. <0.7) for a large portion of the files.
To change the reference file, click the “…” button in the Rawfiles line. Change the reference file by checking a new rawfile. Rerun Align and check the alignment scores again. When finished, run Frame to perform the DICE process.
After feature generation, the .sdb file will contain all quantitative features (i.e. frames) generated. For more detailed information about the use of SIEVE, please refer to SIEVE User Guide.
After protein identification and quantitative feature generation, the R Shiny web app UHR-IonStar will be utlized to integrate the spectrum report with the quantitative feature list and generate the final quantitative results. Procedures in this step include:
UHR-IonStar (Version 1.3.0) includes all former codes in IonStar_Run.R and IonStar_FrameGen.R in the first part called IonStarStat. If you want to have the oringinal code explanation, please go to supplementary part or check here for former source of IonStarStat.
UHR-IonStar is under R version 3.6.2, also need R Studio. And the R Bioconductor version is 3.10.
Please open ‘Setup.R’ to install all packages that the UHR-IonStar depends.
Some words might show in the R console: “Update all/some/none? [a/s/n]:”, please input “a”, or
“Do you want to install from sources the packages which need compilation? (Yes/no/cancel)” It is ok to choose ‘no’.
The user should set the directory to where the UHR-IonStar is located, then install the package IonStarStat_0.1.4.tar.gz
After successfully installing all the packages, go to ‘ui.R’ or ‘server.R’ and click ‘Run App’ at the topleft of the script window to start.
Sometimes it might still produce error due to being unable to find target package even if the user installed all packages mentioned before. Please install the package the console or the pop-up webpage shows until you can see the UHR-IonStar interface. If all things looks good, the web brower would look like the following figure:
In the spectrum report, the rawfile name column (sp_col[1]) should only contain the file name with no extension (e.g. II_B03_21_150304_human_ecoli_A_3ul_3um_column_95_HCD_OT_2hrs_30B_9B), and the MS2 scan number should be numeric (e.g. 58143).
Click ‘IonStarStat’ then upload your files and use some options to generate the annotated frame list and the sample list, which are both required for subsequent protein quantification. The annotated frame list .csv generated consists of Protein accession number, Peptide sequence, Frame ID, and corresponding quantitative values in each sample.
Before running this part, please download the files previous generated and modify the sample list so that each sample is assigned a GroupID. GroupID can be any combinations of alphabetic and numeric symbols, e.g. A, Group1, 088714. There are several hidden data processing approaches embeded in this part. If you want to know more details, go to supplementary part to find code explanation.
Then re-upload the modified file into the web app:
For future data processing, the column “PepNum” in quantitative result dataset need to be removed. Also, if the row names contains many characters, we recommend to rename them to short ones for future convenience.
You also need to change the “Rawfiles” of Sample ID file (group file) corresponding to the row names in quantitative dataset. The order can be different. Note that the modified group file do not have number column in front.
Note that the characters in the first row have some restrictions: cannot use symbols (except "_"), numbers cannot be the initial of names (e.g. "9hTreat"). For protein results, the first column format can be “ProteinAccession:ProteinID” or just containing either of these two. For peptide results, the first column can be “ProteinAccession:ProteinID|PeptideSequence” or containing either of these first two plus the sequence split by "|". Note that if the datasets just have partial names of protein or peptide, you cannot perform results verification in the last part.
This part focuses on data analysis for quantitative results. It contains basic statistical testing, missing data and decoy removal for case-control study design.
Missing data are removed by a simple threshold that target protein has the sum of log intensities in every condition is less than 4. This setting is not shown in the interface.
“Decoys” mean some peptides sequence identified by reverse database, which means these are false positives because they should not exist but they are detected. Decoys are essential for calculating false discovery rates but useless doing quantification. So they are removed by a special name pattern. In the example you can see, all decoys are removed because we know their names contain ‘XXX’ in our reverse database.
After finishing data processing step, all results are saving in the system, which means you can straightly do data visualization part without uploading anything.
This section can visualize data by different ways with different purposes. Total six kinds of graph can be drawn carrying systematic information of quantification results:
Intra-group CV box plot: This plot shows the coefficient of variation for each group.
Intensity curve plot: This plot shows the average protein intensity of every subject, sorting decreasingly.
Inter-group correlation plot: This plot reveals the correlation between two group. Before plotting this, you should set two groups you want to see the correlation.
Pearson Correlation Matrix plot: This plot conveys the correlation information among each pair of observation group.
Plot for Principal Components Analysis: Use Principal Components Analysis (PCA) to cluster data in observation group level. Every observation group will be in the same color, and you can see the relationship among all groups based on the first two principal components.
Ratio Distribution plot: After choosing which group you want the plot to show, you can see the ratio distribution (selected group divided by the control group, multiple selection is available). The curve will be centered by the location of the maximum density after choosing ‘Correction’.
It is crucial to narrow down the research scale from the systematic data to the specific potential biomarkers.
After statistical testing, the proteins or peptides containing significant changes are found. Then there are four plots to show them quantitatively and functionally:
Volcano Plot: shows proteins up-, down-, and un- changed intensity with the different color.
Intensity curve with significantly changed proteins marked
Up-regulated and Down-regulated protein boxplot: set how many proteins shown in the plot first.
Gene Ontology makes a graph that shows biological processes involved with significantly changed proteins. We recommand the ID tpye for each protein is the Protein Accession Number. This plot may turn into error if the number of selected protein is too small.
This part focuses on double checking the quantification results by combining protein with peptide results. The quantity of protein is related to its peptide components. However sometimes, there is a huge intensity difference between protein and its dependent peptides, which could be considered as unreliable data.
User can perform “Data Processing” part for protein and peptide lists separately, then come to “Re-verification” part to get result integration and comparison.
The method for this part is to match the name of protein accession number between two lists. So before doing this part, user should change the list format a little bit:
The integration result is showing below:
v1.1.1:
Add ‘IonstarStat’ part.
Add the option ‘Correction’ when graph a ratio distribution plot.
v1.1.2:
Relax the limitation that each group must have the same amount of column. Replace ‘One-way AVOVA’ to ‘Pair t-test’.(Add function to graph Gene Ontology plot.
v1.2:
The quantification for peptides is available now.(Fixed a bug that cannot choose the size of the plot in Biomarker Discovery section.
v1.3:
The down-regulated plot can work normally.
Improve some interface details.
For questions, suggestions, and other topics about UHR-IonStar, feel free to contact us:
Shuo Qian: sqian@buffalo.edu
Shichen Shen: shichens@buffalo.edu
Xue Wang: xwang79@buffalo.edu
Jun Qu: junqu@buffalo.edu
1. Generate the annotated frame list
#Load IonStarStat
library("IonStarStat")##Generate the annotated frame list
db <- "IonStarPRIDE_database.sdb" ##File name of the SIEVE database
sp <- "IonStarPRIDE_spectrum report.csv" ##File name of the spectrum report
col_filename <- 4 ##Column number for rawfile name
col_scannum <- 17 ##Column number for MS2 scan number
col_framelist <- c(6,18) ##Column numbers for Protein accession number and Peptide sequence
framelist <- "IonStarPRIDE_frame.csv" ##File name of the annotated frame list (output1)
sampleid <-"IonStarPRIDE_sampleid.csv" ##File name of the sample list (output2)
source ("IonStar_FrameGen.R")The annotated frame list .csv generated consists of Protein accession number, Peptide sequence, Frame ID, and corresponding quantitative values in each sample, shown as below.
## ProteinAC PepSeq FrameID A1 B1 C1
## 1 Q96I51:WBS16_HUMAN EAAEAEAEVPVVQYVGER 35199 5452494 475886.4 1391912
## 2 P0C8J6:GATY_ECOLI INVATELK 11407 216541745 262224407.0 342242173
## 3 P0C8J6:GATY_ECOLI NYLTEHPEATDPR 6302 50365797 52927424.6 60233419
## 4 P0C8J6:GATY_ECOLI QWVNLPLVLHGASGLSTK 47084 13635817 18975922.8 21041386
## 5 P0C8J6:GATY_ECOLI QWVNLPLVLHGASGLSTK 85743 158317760 128711136.1 113131881
## 6 P0C8J6:GATY_ECOLI SVMIDASHLPFAQNISR 41490 47259460 47781835.2 55798367
## D1 E1 E2 D2 C2 B2 A2
## 1 2289193 1302592 1146268 1645735 1091675 2789563 660561.7
## 2 411246895 481214021 451974788 394233403 304898893 251107764 175357014.9
## 3 74382575 92162468 94188976 75480174 50618895 41996791 28891661.5
## 4 31074728 39476379 37953565 28216572 22198631 13746699 8390298.0
## 5 116859175 113338204 107014155 112368481 114584694 118152103 121447044.0
## 6 69234723 83597655 86098121 61897136 53934485 40421413 28736591.6
## A3 B3 C3 D3 E3 E4 D4
## 1 1583813 5148398 1923703 3842454 4221733 3554227 3582159
## 2 207088734 254629839 330040590 391575868 494867018 491473921 495393627
## 3 52817845 76774606 62542501 131993194 166741064 171442525 144341093
## 4 12464382 29358146 21184939 50791891 66393810 63215742 48868329
## 5 155394791 156556630 115749102 144621808 136891875 141464442 140116613
## 6 48176223 69299555 52708139 116873607 142661980 134440236 110236215
## C4 B4 A4
## 1 4278508 5723243 3780455
## 2 389784097 318133056 216134657
## 3 110284315 81203507 59372919
## 4 35728182 22188573 13603490
## 5 143775988 150280000 149897829
## 6 85939761 63125621 46183524
2. Perform protein quantification
## RawFiles GroupID
## 1 II_B03_21_150304_human_ecoli_A_3ul_3um_column_95_HCD_OT_2hrs_30B_9B A
## 2 II_B03_02_150304_human_ecoli_B_3ul_3um_column_95_HCD_OT_2hrs_30B_9B B
## 3 II_B03_03_150304_human_ecoli_C_3ul_3um_column_95_HCD_OT_2hrs_30B_9B C
## 4 II_B03_04_150304_human_ecoli_D_3ul_3um_column_95_HCD_OT_2hrs_30B_9B D
## 5 II_B03_05_150304_human_ecoli_E_3ul_3um_column_95_HCD_OT_2hrs_30B_9B E
## 6 II_B03_06_150304_human_ecoli_E_3ul_3um_column_95_HCD_OT_2hrs_30B_9B E
## 7 II_B03_07_150304_human_ecoli_D_3ul_3um_column_95_HCD_OT_2hrs_30B_9B D
## 8 II_B03_08_150304_human_ecoli_C_3ul_3um_column_95_HCD_OT_2hrs_30B_9B C
## 9 II_B03_09_150304_human_ecoli_B_3ul_3um_column_95_HCD_OT_2hrs_30B_9B B
## 10 II_B03_10_150304_human_ecoli_A_3ul_3um_column_95_HCD_OT_2hrs_30B_9B A
## 11 II_B03_11_150304_human_ecoli_A_3ul_3um_column_95_HCD_OT_2hrs_30B_9B A
## 12 II_B03_12_150304_human_ecoli_B_3ul_3um_column_95_HCD_OT_2hrs_30B_9B B
## 13 II_B03_13_150304_human_ecoli_C_3ul_3um_column_95_HCD_OT_2hrs_30B_9B C
## 14 II_B03_14_150304_human_ecoli_D_3ul_3um_column_95_HCD_OT_2hrs_30B_9B D
## 15 II_B03_15_150304_human_ecoli_E_3ul_3um_column_95_HCD_OT_2hrs_30B_9B E
## 16 II_B03_16_150304_human_ecoli_E_3ul_3um_column_95_HCD_OT_2hrs_30B_9B E
## 17 II_B03_17_150304_human_ecoli_D_3ul_3um_column_95_HCD_OT_2hrs_30B_9B D
## 18 II_B03_18_150304_human_ecoli_C_3ul_3um_column_95_HCD_OT_2hrs_30B_9B C
## 19 II_B03_19_150304_human_ecoli_B_3ul_3um_column_95_HCD_OT_2hrs_30B_9B B
## 20 II_B03_20_150304_human_ecoli_A_3ul_3um_column_95_HCD_OT_2hrs_30B_9B A
Make sure to load IonStarStat by library("IonStarstat"). Read the annotated frame list and the grouped sample list into R environment.
rawfile <- "IonStarPRIDE_Frame.csv"
condfile <- "IonStarPRIDE_Groups.csv"
raw <- read.csv(rawfile)
cond <- read.csv(condfile)
condition <- cond[match(colnames(raw)[-c(1:3)], cond[,1]),2]
condition## [1] A B C D E E D C B A A B C D E E D C B A
## Levels: A B C D E
Use newProDataSet to remove redundant frames (i.e. frames assigned to multiple peptide sequences), which causes ambiguity in quantification.
pdata <- newProDataSet(proData=raw, condition=condition)The number of proteins before and after removal, as well as the number of redundant frames removed will be reported in the console.
## Input 3886 proteins.
## 6489 duplicated frames founded.
## 3873 proteins left after filtering.
Use pnormalize to perform inter-sample normalization of quantitative intensities. Aggregation of frame data to peptide data can be done by summarize=TRUE. Normalization can be based on either total ion intensities (method="TIC") or quantiles (method="quantiles") in each sample. Use method=NULL to skip normalization.
ndata <- pnormalize(pdata, summarize=TRUE, method="TIC")Boxplots of peptide quantitative data before (left) and after (normalization) are shown as follows.
Use OutlierPeptideRM to perform outlier peptide detection. IonStar uses Principal Component-based Outlier Detection (PCOut) for outlier detection, which is tailored for multi-condition comparison (at least 3 conditions including control).
Parameter variance (0.7~0.9) can be adjusted according to the stringency needed for outlier detection. The higher the value the more outliers will be rejected.
cdata<-OutlierPeptideRM(ndata,condition,variance=0.7,critM1=1/3,critM2=1/4,ratio=TRUE)## 6049 outliers were removed; 21937 peptides left after outlier removal.
For case-control comparison, set parameter ratio=FALSE. Alternatively, Grubb’s test can be used for outlier rejection, which will be available in the next build of IonStarStat.
Use SharedPeptideRM to remove shared peptides (i.e. peptides inferred to multiple unique protein groups, a.k.a. degenerate peptides). This step is optional as many highly abundance proteins share a large proportion of homologous sequence domains. Removal of these peptides could be counterproductive for quantification. However, in specific cases, such as quantification of mixed-species samples, removal of shared peptides with species ambiguity is necessary to obtain species-specific quantitative results.
#Opional removal of shared peptides
cdata<-SharedPeptideRM(cdata)Use ProteinQuan to aggregate peptide-level quantitative data to protein level. Both sum intensities (method="sum") and General Linear Mixed Model (method="fit") can be used for peptide-to-protein aggregation.
quan <- ProteinQuan(eset=cdata, method="sum")## PepNum A1 B1 C1 D1 E1 E2
## A0AVT1:UBA6_HUMAN 4 26.62118 26.70643 26.70311 26.55632 26.56956 26.50505
## A0FGR8:ESYT2_HUMAN 12 29.14639 29.14287 29.19418 29.11159 29.07409 29.04383
## A0MZ66:SHOT1_HUMAN 8 27.38884 27.12556 27.21083 27.11330 27.08704 26.95478
## A1L0T0:ILVBL_HUMAN 4 24.82774 25.34471 25.23324 25.22633 25.29648 25.25879
## A1X283:SPD2B_HUMAN 4 25.91957 25.89851 25.98069 25.75741 25.62000 25.52778
## A2RRP1:NBAS_HUMAN 2 23.21671 23.42673 23.27803 23.06164 22.72570 22.98854
## D2 C2 B2 A2 A3 B3
## A0AVT1:UBA6_HUMAN 26.59699 26.71673 26.65676 26.77142 26.80597 26.63028
## A0FGR8:ESYT2_HUMAN 29.11444 29.18659 29.18904 29.28187 29.19237 29.11881
## A0MZ66:SHOT1_HUMAN 27.11314 27.28457 27.07045 27.16919 27.32022 27.35312
## A1L0T0:ILVBL_HUMAN 25.39263 25.29545 25.22623 25.41492 25.25910 24.90396
## A1X283:SPD2B_HUMAN 25.70657 25.94787 25.87455 25.82754 26.14511 25.99702
## A2RRP1:NBAS_HUMAN 22.89805 23.27723 23.18188 23.39038 23.13157 23.36878
## C3 D3 E3 E4 D4 C4
## A0AVT1:UBA6_HUMAN 26.64539 26.37873 26.50122 26.37168 26.58727 26.50536
## A0FGR8:ESYT2_HUMAN 29.20611 28.87648 28.95956 28.84509 28.92555 28.92343
## A0MZ66:SHOT1_HUMAN 27.22668 27.28598 27.35093 27.21666 27.19223 27.25682
## A1L0T0:ILVBL_HUMAN 25.21406 24.71122 24.80994 24.84718 24.86370 24.78118
## A1X283:SPD2B_HUMAN 25.71892 25.62937 25.82062 25.72596 25.88044 25.80138
## A2RRP1:NBAS_HUMAN 23.10810 23.13342 22.86591 22.97998 23.19907 22.89899
## B4 A4
## A0AVT1:UBA6_HUMAN 26.60836 26.72253
## A0FGR8:ESYT2_HUMAN 29.11381 29.17440
## A0MZ66:SHOT1_HUMAN 27.33590 27.39389
## A1L0T0:ILVBL_HUMAN 24.96034 25.14340
## A1X283:SPD2B_HUMAN 25.88904 26.01037
## A2RRP1:NBAS_HUMAN 23.26405 23.07043
Users can export both peptide and protein quantitative results by write.csv.
write.csv(quan,"IonStarPRIDE_protein_quan.csv")
write.csv(exprs(cdata),"IonStarPRIDE_peptide_quan.csv")